Skip to content

Conversation

@e06084
Copy link
Collaborator

@e06084 e06084 commented Dec 16, 2025

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @e06084, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求引入了一个全新的、更真实的端到端 RAG 评估流程,专注于 FiQA 数据集。通过直接从 HuggingFace 加载数据并使用 Dingo 框架进行多维度评估,它提供了一个标准化的基准测试方法,取代了之前基于本地模拟数据的评估方式,从而显著提升了 RAG 系统评估的准确性和可复现性。

Highlights

  • 新增端到端 RAG 评估脚本: 新增了 e2e_RAG_eval_with_mockRAG_fiqa.py 脚本,用于对 FiQA 数据集进行全面的端到端 RAG 系统评估。
  • 集成 HuggingFace 数据集: 新的评估脚本直接从 HuggingFace 下载 FiQA 语料库和测试数据,确保评估过程的标准化和可复现性。
  • 全面的 Dingo 评估指标: 利用 Dingo 框架对 RAG 系统的输出进行多维度评估,包括忠实度、上下文精确度、召回率、相关性和答案相关性等关键指标。
  • 文档更新: 更新了中文 RAG 评估文档 (docs/rag_evaluation_metrics_zh.md),以反映新的端到端评估脚本和重命名的基线评估脚本。
  • 旧脚本移除: 移除了旧的 eval_with_mock_rag.py 模拟 RAG 评估脚本,其功能已被更先进的端到端评估方案取代。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the end-to-end RAG evaluation examples. It removes an old example script and adds a more comprehensive one (e2e_RAG_eval_with_mockRAG_fiqa.py) that performs a full RAG evaluation on the FiQA dataset. The documentation is also updated to reflect these changes. The new example script is well-structured and provides a solid end-to-end workflow. My review focuses on improving its maintainability and robustness. I've suggested making default parameters more consistent, handling file paths more robustly, cleaning up return values, and reducing code duplication. These changes should make the example script easier to understand, use, and maintain.

"""简单的 RAG 系统"""

def __init__(self, llm_client: AsyncOpenAI, retriever: FiQACorpusRetriever,
system_prompt: str = None, model: str = "gpt-3.5-turbo"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for the model parameter is "gpt-3.5-turbo", which is inconsistent with the rest of the script that defaults to using "deepseek-chat" from the OPENAI_MODEL environment variable. This can be misleading for anyone using this class. To improve consistency and maintainability, I suggest making the default value consistent with the global configuration.

Suggested change
system_prompt: str = None, model: str = "gpt-3.5-turbo"):
system_prompt: str = None, model: str = "deepseek-chat"):

Comment on lines +192 to +196
return {
"answer": answer,
"retrieved_documents": docs,
"context_list": [doc.page_content for doc in docs]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The query method returns a retrieved_documents field which is a list of langchain_core.documents.Document objects. These objects are not JSON serializable, and this field is not used by the calling function generate_rag_responses (which uses context_list instead). To improve clarity and avoid returning unnecessary, non-serializable data, I suggest removing retrieved_documents from the return dictionary. A similar change should be applied to the return statement in the if not docs: block on lines 167-171.

        return {
            "answer": answer,
            "context_list": [doc.page_content for doc in docs]
        }

Comment on lines +270 to +305
def print_metrics_summary(summary: SummaryModel):
"""打印指标统计摘要"""
if not summary.metrics_score_stats:
print("⚠️ 没有指标统计数据")
return

print("\n" + "=" * 80)
print("📊 RAG 评估指标统计")
print("=" * 80)

for field_key, metrics in summary.metrics_score_stats.items():
print(f"\n📁 字段组: {field_key}")
print("-" * 80)

for metric_name, stats in metrics.items():
display_name = metric_name.replace("LLMRAG", "")
print(f"\n {display_name}:")
print(f" 平均分: {stats.get('score_average', 0):.2f}")
print(f" 最小分: {stats.get('score_min', 0):.2f}")
print(f" 最大分: {stats.get('score_max', 0):.2f}")
print(f" 样本数: {stats.get('score_count', 0)}")
if 'score_std_dev' in stats:
print(f" 标准差: {stats.get('score_std_dev', 0):.2f}")

overall_avg = summary.get_metrics_score_overall_average(field_key)
print(f"\n 🎯 该字段组总平均分: {overall_avg:.2f}")

metrics_summary = summary.get_metrics_score_summary(field_key)
sorted_metrics = sorted(metrics_summary.items(), key=lambda x: x[1], reverse=True)

print("\n 📈 指标排名(从高到低):")
for i, (metric_name, avg_score) in enumerate(sorted_metrics, 1):
display_name = metric_name.replace("LLMRAG", "")
print(f" {i}. {display_name}: {avg_score:.2f}")

print("\n" + "=" * 80)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function print_metrics_summary seems to be a utility function that is also present in other example scripts like examples/rag/dataset_rag_eval_baseline.py. To adhere to the DRY (Don't Repeat Yourself) principle and improve maintainability, consider moving this function to a shared utility module (e.g., examples/rag/utils.py) and importing it where needed.

elif args.limit:
output_filename = f"fiqa_end_to_end_rag_output_limit_{args.limit}.jsonl"

output_path = "test/data/" + output_filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using + for path concatenation is not robust and can lead to issues on different operating systems. It's better to use os.path.join() to construct file paths. Additionally, the output directory "test/data/" is hardcoded, which makes the script less flexible. Consider making this a configurable parameter, for example, via a command-line argument.

    output_path = os.path.join("test/data", output_filename)

@e06084 e06084 merged commit 6e1afc6 into MigoXLab:dev Dec 16, 2025
2 checks passed
tenwanft pushed a commit to tenwanft/dingo that referenced this pull request Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant